arXivrl 509.03956V 1 [cs.CV] 14 Sep 2015 


Learning to Divide and Conquer for Online Multi-Target Tracking 


Francesco Solera Simone Calderara Rita Cucchiara 

Department of Engineering 
University of Modena and Reggio Emilia 

name.surname@unimore.it 


Abstract 

Online Multiple Target Tracking (MTT) is often addressed 
within the tracking-by-detection paradigm. Detections are 
previously extracted independently in each frame and then 
objects trajectories are built by maximizing specifically de¬ 
signed coherence functions. Nevertheless, ambiguities arise 
in presence of occlusions or detection errors. In this paper 
we claim that the ambiguities in tracking could be solved by 
a selective use of the features, by working with more reliable 
features if possible and exploiting a deeper representation of 
the target only if necessary. To this end, we propose an online 
divide and conquer tracker for static camera scenes, which 
partitions the assignment problem in local subproblems and 
solves them by selectively choosing and combining the best 
features. The complete framework is cast as a structural 
learning task that unifies these phases and learns tracker 
parameters from examples. Experiments on two different 
datasets highlights a significant improvement of tracking 
performances (MOTA +10%) over the state of the art. 

1. Introduction 

Multiple Target Tracking (MTT) is the task of extract¬ 
ing the continuous path of relevant objects across a set of 
subsequent frames. Due to the recent advances in object 
detection [10, 4], the problem of MTT is often addressed 
within the tracking-by-detection paradigm. Detections are 
previously extracted independently in each frame and then 
objects trajectories are built by maximizing specifically de¬ 
signed coherence functions [19, 5, 21, 2, 9, 24]. Tracking 
objects through detections can mitigate drifting behaviors in¬ 
troduced by prediction steps but, on the other hand, it forces 
the tracker to work in adverse conditions, due to the frequent 
occurrence of false and miss detections. 

The majority of approaches address MTT offline, i.e. by 
exploiting detections from a set of frames [19, 5, 9] through 
global optimization. Offline methods benefit from the big¬ 
ger portion of video sequence they dispose of to establish 



Figure 1: The scene is partitioned in local zones. Green zones is 
where the same number of tracks and detections are present. Red 
zones, where miss and false detections (white dashed contours) 
are discovered and solving the associations may call for complex 
appearance or motion features. 

spatio-temporal coherence, but can not be used in real-time 
applications. Conversely, online methods track the targets 
frame-by-frame; they have a larger spectra of application 
but must be both accurate and fast despite working with less 
data. In this context, the robustness of the features play a 
major role in the online MTT task. Some approaches claim 
the adoption of complex targets models [2, 24] to be the 
solution, while others argue that this complexity may affect 
the long-term robustness [23]. For instance, in large crowds 
people appearance is rarely informative. As a consequence, 
tracking robustness is often achieved by focusing on spatial 
features [21], finding them more reliable than visual ones. 

We do believe that many of the ambiguities in tracking 
could be solved by a selective use of the features, by working 
with more reliable features if possible and exploiting a deeper 
representation of the target only if necessary. In fact, a 
simple spatial association is often sufficient while, as clutter 
or confusion arise, an improved association scheme on more 
complex features is needed (Fig. 1). 

In this paper a novel approach for online MTT in static 
camera scenes is proposed. The method selects the most 
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suitable features to solve the frame-by-frame associations 
depending on the surrounding scene complexity. 
Specifically, our contributions are: 

• an online method based on Correlation Clustering that 
learns to divide the global association task in smaller 
and localized association subproblems (Sec. 5), 

• a novel extension to the Hungarian association scheme, 
flexible enough to be applied to any set of preferred 
features and able to conquer trivial and complex sub¬ 
problems by selectively combining the features (Sec. 6), 

• an online Latent Structural SVM (LSSVM) framework 
to combine the divide and conquer steps and to learn 
from examples all the tracker parameters (Sec. 7). 

The algorithm works by alternating between (a) learning the 
affinity measure of the Correlation Clustering as a latent 
variable and (b) learning the optimal combinations for both 
simple and complex features to be used as cost functions by 
the Hungarian. Results on public benchmarks underline a 
major improvement in tracking accuracy over current state 
of the art online trackers (-1-10% MOTA). 

The work takes inspiration from the human perceptive behav¬ 
ior, further introduced in Sec. 3. According to the widely ac¬ 
cepted two-streams hypothesis by Goodale and Milner [12], 
the use of motion and appearance information is localized in 
the temporal lobe (what pathway), while basic spatial cues 
are processed in the parietal lobe (where pathway). This 
suggests our brain processes and exploits information in 
different and specific ways as well. 

2. Related works 

Tab. 1 reports an overview of recent tracking-by-detection 
literature approaches separating online and offline methods 
and indicates the adoption of tracklets (T), appearance mod¬ 
els (A) and complex learning schemes (L). Offline meth¬ 
ods [5, 19, 13, 18] are out of the scope of the paper and are 
reported for the sake of completeness. 

Tracklets are the results of an intermediate hierarchical as¬ 
sociation of the detections and are commonly used by both 
offline and online solutions [18, 13, 25]. In these ap¬ 
proaches, high confidence associations link detections in 
a pre-processing step and then optimization techniques are 
employed to link tracklets into trajectories. Nevertheless 
tracklets creation involves solving a frame by frame assign¬ 
ment problem by thresholding the final association cost and 
errors in tracklets affect the tracking results as well. 

In addition, online methods often try to compensate the lack 
of spatiotemporal information through the use of appearance 
or other complex features model. Appearance model is typi¬ 
cally handled by the adoption of a classifier for each tracked 
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Table 1: Overview of offline and online related works in terms of 
code availability (C), appearance models (A), tracklets computation 
(T), associations learning (L) and presence in the MOT Challenge 
competition (M). In our method, use of appearance set to V 2 means 
only when needed. 

target [24] and data associations is often finalized through 
an averaged sum of classifiers scores, [7, 2]. As a conse¬ 
quence, learning is on targets model, not on associations. 
Moreover, online methods also need to cope with drifting 
when updating their targets model. One possible solution is 
to avoid model updating when uncertainties are detected in 
the data, i.e. a detection cannot be paired to a sufficiently 
close previous trajectory [2]. Nevertheless, any error intro¬ 
duced into the model can rapidly lead to tracking drift and 
wrong appearance learning. Building on these considera¬ 
tions, Possegger et al [21] does not consider appearance at 
all and only work with distance and motion data. 
Differently from the aforementioned online learning meth¬ 
ods, our approach is not hierarchical and we do not compute 
intermediate tracklets because errors in the tracklets corrupt 
the learning data. Similarly to [2], we model a score of 
uncertainty but based on distance information only and not 
on the target model, since distance can not drift over time. 
This enables us to invoke appearance and other less stable 
features only when truly needed as in the case of missing 
detections, occluded objects or new tracks. 

3 . Related perception studies 

The proposed method is inspired by the human cognitive 
ability to solve the tracking task. In fact, events such as eye 
movements, blinks and occlusions disrupt the input to our 
vision system, introducing challenges similar to the ones 
encountered in real world video sequences and detections. 
Perception psychologists have studied the mechanisms em¬ 
ployed in our brain during multiple object tracking since the 
’80s [14, 22, 1], though only recently RMI experiments have 
been used to confirm and validate proposed theories. One 




































Figure 2: First row shows the human traeking proeess aeeording to 
Kahneman, Treisman and Gibbs theory [14]. Below a schematic 
view of the inference and learning steps underpinning our method. 


of these preeminent theories is given in a seminal work by 
Kahneman, Treisman and Gibbs in 1992 [14]. They pro¬ 
posed the theory of Object Files to understand the dominant 
role of spatial information in preserving target identity. The 
theory highlights the central role of spatial information in 
a paradigm called Spatio-Temporal Dominance. Accord¬ 
ingly, target correspondence is computed on the basis of 
spatio-temporal continuity and does not consult non-spatial 
properties of the target. If spatio-temporal information is 
consistent with the interpretation of a continuous target, the 
correspondence will be established even if appearance fea¬ 
tures are inconsistent. “Up in the sky, look: It’s a bird. It’s 
a plane. It’s Superman!” - this well known quote, from the 
respective short animated movie (1941), suggests that the 
people pointing at Superman changed their visual perception 
of the target to the extent of giving him a completely different 
meaning, while they never had any doubt they kept refer¬ 
ring to the same object. Nevertheless, when correspondence 
cannot be firmly established on the basis of spatial informa¬ 
tion, appearance, motion, and other complex features can be 
consulted as well. In particular, in [14] the tracking process 
is divided into a circular pipeline of three steps (Fig 2, top 
row). The correspondence uses only positional information 
and aims at establishing if detected objects are either a new 
target or an existing one appearing at a different location. 
The review activates when ambiguity in assignments arises, 
and recomputes uncertain target links by also taking into 
account more complex features. Eventually, the impletion is 
the final task to assess and induce the perception of targets 
temporal coherence. 

4. The proposal 

As depicted in Fig. 2, the proposed method relates the 3 
steps of correspondence, review and impletion to a divide 
and conquer approach. Targets are divided in the where 
pathway by checking for incongruences in spatial coherence. 
Eventually, the tracking solution is conquered by associat¬ 


ing coherent elements in the where (spatial) domain and 
incoherent ones in the what (visual) domain. 

The core of the proposal is twofold. First, a method to 
divide potential associations between detections and tracks 
into local clusters or zones. A zone can be either simple 
or complex, calling for different features to complete the 
association. Targets can be directly associated to their closest 
detections if they are inside a simple zone {e.g. when we 
have the same number of tracks and detections, green area 
in Fig. 3b). Conversely, targets inside complex areas (red in 
Fig. 3b) are subject to a deeper evaluation where appearance, 
motion and other features may be involved. 

Second, we cast the problems of splitting potential asso¬ 
ciations and solving them by selecting and weighting the 
features inside a unified structural learning framework that 
aims at the best set of partitions and adapts from scene to 
scene. 

4.1. Problem formulation 


Online MTT is typically solved by optimizing, at frame 
k, a generic assignment function for a set of tracks T and 
current detections : 

n 

h{T, Vk) = arg min Y] C(i, y*), (1) 


where y is a permutation vector of {1, 2,..., n} and C € 
I^nxn jg ^ matrix. The cost matrix C is designed to 
include dummy rows and columns to account for new de¬ 
tected objects (Din) or leaving targets (Tout)- More formally, 
if matrix A : T x D/j. ^ M contains association costs for 
currently tracked targets and detections, the cost matrix is: 


C = 


A 

Din 



( 2 ) 


where Din, Tout contain the cost ^ of creating a new track 
on the diagonal and +oc elsewhere. Similarly, S is a full 
matrix of value 

The formulation in Eq. (1) evaluates all the associations 
through the same cost function, built upon a preferred set 
of features. In order to consider different cost functions for 
specific subsets of associations, we reformulate Eq. (1) as: 


h{T,Vk) =&rgmm Cs(i,y*)+ Cc(i,y*) 

(i,y")ez (i,y")ez 

zG>2s zG>2c 

(3) 

where we explicit the different contribution of trivial and 
difficult associations, whose costs are given by the functions 
Cs and Cc respectively. Associations are locally partitioned 
in zones z G .2^ as shown in Fig. 3b. Hereinafter, we seam¬ 
lessly refer to a zone z as a portion of the scene or the set of 
detections and tracks that lie onto it. A zone can be simple 
z G 2^s 01' complex to solve z G 2^c depending on the set of 
associations it involves. 
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Figure 3: Overview of the inference procedure, (a) In the image targets are represented by bird eye view sketches (shaded when occluded) 
and detections by crosses, (b) In the divide step detections and non-occluded targets are spatially clustered into zones. A zone with an equal 
number of targets and detections is simple (solid green contours), complex otherwise (dashed red contours), (c) Associations in simple zones 
are independently solved by means of distance features only. Complex zones are solved by considering more complex features such as 
appearance or motion and accounting for potentially occluded targets, which are shared across all the complex zones. 


5. Learning to divide 

In this section, we propose a method to generate zones z 
and decide whether associations in those zones are simple 
z e Zs or difficult z e Zc. A zone z can be defined as an 
heterogeneous set of tracks and detections characterized by 
spatial proximity. Even if simple, the concept of proximity 
may vary across sequences, and the importance of distances 
on each axis depends on targets dominant flows in the scene. 
Zones are computed through the Correlation Clustering (CC) 
method [3] on the cost matrix A suitably modified to obtain 
an affinity matrix A as required by the CC algorithm. To 
move from cost features (distances) in A to affinity features 
in A, the cost features vector is augmented with their simi¬ 
larity counterpart and the affinity value is computed as the 
scalar product between this vector and a parameter vector 6: 


A(i,i) = 

' -V-" '-V- 

cost features similarity features 



^4) 

where f and are the i-th track and j-th detection respec¬ 
tively. The 6 vector has the triple advantage of weighting 
differently distances on each axis, avoiding to set thresholds 
in the affinity computation and controlling the compactness 
and the balancing of clusters. Further detail on learning 9 
are provided in the following sections. 

To prevent the creation of clusters composed only of 
detections or tracks, a symmetric version of A is created 
having a zero block diagonal structure: 


scene in a set of zones Z so that the sum of the affinities 
between track-detection pairs in the same zone is maximized: 

arg mux E E -^sym ihj)- (6) 

Eventually, a zone z is defined as simple if it contains an 
equal number of targets and detections, otherwise is com¬ 
plex. As previously stated, associations in a complex zone 
z G Zc cannot be solved with the use of distance informa¬ 
tion only (Fig. 3b), but require more informative features to 
disambiguate the decision. 

6. Learning to conquer 

The divide mechanism brings the advantage of splitting 
the problem into smaller local subproblems. Associations be¬ 
longing to simple zones can be independently solved trough 
any bipartite matching algorithm. The complete tracking 
problem must deal also with occluded target as well. We 
consider a target as occluded when it is not associated to a 
detection (e.g. a miss detection in frame k occurred, shaded 
people in Fig. 3). Since occluded targets are representation 
of disappeared objects, they are not included in the zones at 
the current frame. All the subproblems related to complex 
zones z e Zc are consequently connected by sharing the 
whole set of occluded targets. In order to simultaneously 
solve the whole set of subproblems, we construct an aug¬ 
mented version of the matrix in Eq. (2) where the block H 
accounts for potential associations between occluded tracks 
and current detections: 


A — 

-^sym — 

■ 0 
A 

A 

0 

(5) 

C = 

"A 

H 

+ OC 
Hqcc 

Tout 
+ OC 






Din 

S 

S 


Through this shrewdness, two tracks (detections) can be in 
the same cluster only if close to a common detection (track). 
The CC algorithm, applied on Asym, efficiently partition the 


(7) 


Hocc is a ^-diagonal matrix (+oc elsewhere) used to keep 
occluded tracks still occluded in the current frame. The 













solution of the optimization problem in Eq. 1 on matrix C, 
obtained by applying the Hungarian algorithm, provides the 
final tracking associations for this frame. 

More precisely, thanks to the peculiar block structure 
of C a single call to Hungarian results in solving the 
partitioned association problem in Eq. (3), subject to the 
constraint that each occluded element can be inserted in a 
single complex zone subproblem solution. In C, simple 
zones subproblems are isolated by setting the association 
cost outside the zone to +oo. Similarly, complex zones 
results in independent blocks as well, but are connected 
through the presence of occluded elements, i.e. non-infinite 
entries in H. 

By casting the problem using the cost matrix C, it is possible 
to learn, in a joint framework, to combine features in order to 
obtain a suitable cost for both the association (either in sim¬ 
ple or complex zones) and the partition in zone as well. To 
this end we introduce a linear w-parametrization on A and 
H with a mask vector that selects the features according 
to the complexity of the belonging zone : 

( 8 ) 

being o the Hadamard product. The feature vector contains 
both simple and complex information between the i-th track 
and the j-th detection: 

■ v*' ^ 

^ features for z G 

1 - K-dl\, 

^ -V-" 

features for divide step 

features for z G 

(9) 

where , ^ 2 , • • • are distance functions between track i and 
detection j on complex features 1 and 2 respectively. Pre¬ 
cisely, T^z selectively activates features according to the 
following rules: 

P 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ,...) if(a) 

<( 0,0,0,0,0,1,1,1,1,...) if(b) (10) 

[ ( 00 ,0,0,0,0,0,0,0,0,...) if (c) 

where the pair target-detection in C^(i, j) may (a) belong 
to the same simple zone, (b) be composed by elements be¬ 
longing to complex zones and (c) have elements belonging 
to different zones. 

The feature vector f(i, j) is computed only on pairs of 
(possibly occluded) tracks and detections. To extend the 
parametrization to the whole matrix C, it is sufficient 
to set TVz = (1,0,0,...)^ outside A and H. Anal¬ 
ogously, for elements C{i,j) outside A or H, we set 


f(i,j) = (oo,0,0, ..^)^andf(i,j) = (1,0,0,... )^ when 
^{hj) = +OC and C(i, j) = ^ respectively. The learning 
procedure in Sec. 7 computes the best weight vector w and 
consequently ^ is learnt as a bias term. Recall that ^ governs 
tracks initiation and termination. Eq. (3) becomes a linear 
combination of the weights w and a. feature map 

n 

h{T,'Dk;w) = argmax-w^ y]7r2{z,y*) of(z,y*) 

y,2 “ 

1 = 1 

= argmaxw^T>(T, Vk,y,Z). 

( 11 ) 

The feature map ^ is a function evaluating how well the set 
of zones Z and the proposed tracking solution y for frame k 
fit on the input data T and Vk. 

Given a set of weights w, the tracking problem in Eq. (11) 
can be solved by first computing the zones Z through the 
divide step on matrix Agym of Eq. (5) and then by con¬ 
quering the associations in each zone through the Hun¬ 
garian method on matrix C. Note that now Asym(^,7) = 
w^(0,1,1,1,1, 0, 0,... )^ o f(i, j) and 0 is a subset of w. 

7. Online subgradient optimization 

The problem of Eq. (11) requires to identify the complex 
structured object (y, Z) G y x Z such that Z is the set 
of zones that best explain the k-i\\ frame tracking solution 
y for an input (T, 77/c). Zones z G .2^ are modelled as la¬ 
tent variables, since they remain unobserved during training. 
To this end, we learn the weight vector w in /i(T, w) 
through Latent Structural SVM [26] by solving the follow¬ 
ing unconstrained optimization problem over the training set 
S = {(T, 77/c,y/c)}/c=i...iG- 

rmn^||w||2 + Gy^^^(w), (12) 

^ k=l 

with Hk (w) being the structured hinge-loss. Hk (w) results 
from solving the loss-augmented maximization problem 

Hk{^) = maxi7/e(y,^; w) 

(13) 

= maxA/,(y,Z) - (w,'0/,(y, Z)), 

where Z) = A(y/e, Z/., y, Z) is a loss function that 

measures the error of predicting the output y instead of 
the correct output yk while assuming Z to hold instead 
of Zk, and we defined ipk{y,Z) = ^(T, P/c, y/c, ^/c) - 
$(T, 77/c, y, >2^) for notation convenience. 

Solving Eq. (13) is equivalent to finding the output-latent 
pair {y,Z) generating the most violated constraint, for a 
given input {T^Vk) and a latent setting Zk. Despite the 
generality of the learning framework, the loss function A is 






Algorithm 1 Block-Coordinate Primal-Dual Frank-Wolfe Algorithm for learning w on a sequence of K frames 
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Let ^ 0, ^ 0, ^ 0, ^ 0 for /c = 1,..., Ff 

for /c ^ 1 to AT do 

Compute simple features for learning to divide Eq. (9) 

Latent completion: = arg max^ w^^(T, D/c, y/c, Z) through Correlation Clustering on Asym of Eq. (5) 

Compute complex features for learning to conquer Eq. (9) 

Max Oracle: (y/^, Zk) = arg maxy^^ Hk{y,Z;w) through Hungarian on Eq. (14) 

LetWg ^ J) and 4 ^ iAfc(y, J) 

Let 7 - C + 4]/[A||w^’'^ - Ws||^] and clip to [0,1] 

Update ^ (1 - 7 )wM + 7 W, and 4’’+') ^ (1 - 7 )?^ + 

Update w^+i) w^) + and “ C 

end for 



(a) Inference Step (b) Maximization Oracle 


Figure 4: Thanks to the choice of the Hamming loss, the maximiza¬ 
tion oracle is reduced to an assignment problem efficiently solved 
through the Hungarian algorithm, as for the inference step. 


problem dependent and must be accurately chosen. In partic¬ 
ular, we adopted the Hamming loss function that, substituted 
in Eq. (13), behaves linearly making the maximization oracle 
solvable as a standard assignment problem, Eig. 4b: 


n 

Hk{w) =in£^y]l{y^ 7^y*} + w^7r2(z,y*)of(z,y*) 

(14) 

where ^(T, D/., y/^, Z/.) was dropped as not dependent on 
either y or Z. 

The learning step of Eq. (12) can be efficiently solved online, 
under the premise that the dual formulation of LSSVM re¬ 
sults in a continuously differentiable convex objective after 
latent completion. We designed a modified version of the 
Block-Coordinate Erank-Wolfe algorithm [16] presented in 
Alg. 1. The main insight here is to notice that the linear 
subproblem employed by Erank-Wolfe (line 5) is equiva¬ 
lent to the loss-augmented decoding subproblem of Eq. (14), 
which can be solved efficiently through the Hungarian algo¬ 
rithm [15]. To deal with latent variables during optimization, 
we added the latent completion process (line 4) where, given 
an input/output pair, the latent variable Z^ which best ex¬ 
plain the solution y^ to the observed data is found. Through 
the latent completion step, the objective function optimized 
by Erank-Wolfe has guarantees to be convex. 


8. Experimental results 

In this section we present two different experiments that 
highlight the improvement of our method over state of the 
art trackers in static camera sequences. The first experiment 
is devoted to stress the method in clutter scenarios where 
moderate crowd occurs and our divide and conquer approach 
gives its major benefits in terms of both computational speed 
and performances. The second experiment is on the publicly 
available MOT Challenge dataset that is becoming a standard 
for tracking by detection comparison. Test were evaluated 
employing the CLEAR MOT [6] measures and trajectory 
based measures (MT,ML,FRG) as suggested in [20]. All the 
detections, where not provided by authors, have been com¬ 
puted using the method in [10] as suggested by the protocol 
in [20]. Results are averaged per experiment in order to have 
a quick glimpse on the tracker performances. Individual 
sequences results are provided in the additional material. 
To train the parameters acting on the complex zones, the 
LSSVM have been trained with ground truth (GT) trajec¬ 
tories and the addition of different levels of random noise 
simulating miss and false detections. In all the tests, oc¬ 
cluded objects locations are updated in time using a Kalman 
Eilter with a constant velocity state transition model, and 
discarded if not reassociated after 15 frames. 

8.1. Features 

The strength of the proposal is the joint LSSVM frame¬ 
work that learns to weight features for both partitioning the 
scene and associating targets. On these premises, we pur¬ 
posely adopted standard features. Without loss of generality, 
the method can be expanded through additional and more 
complex features as well. The features always refer to a 
single detection d and a single track t G T, occluded 
or not, and its associated history, in compliance with Eq. (9). 

In the experiments, the appearance of the targets is mod¬ 
eled through a color histogram in the RGB space. Every 
time a new detection is associated to a track, its appearance 











Figure 5: Tracking results on PETS09-S2L3, IshatianS and GVEII from the MCD dataset (top row). AVG-TownCentre, ADL-Rundle-3 and 
Venice-1 from the MOT Challenge sequences (bottom). Next to images, simple (green) and complex (red) zones are displayed. 


information is stored in the track history. The appearance 
feature gi is then computed as the average value of the 
Kullback-Leibler distance of the detection histogram from 
track previous instances. Additionally, we designed tracks 
to contain their full trajectories over time. By disposing of 
the trajectories, we modeled the motion coherence g 2 of a 
detection w.r.t a track by evaluating the smoothness of the 
manifold fitted on the joint set of the new detected point and 
the track spatial history. More precisely, given a detected 
point, an approximate value of the Ricci curvature is com¬ 
puted by considering only the subset of detections of the 
trajectory lying inside a given neighborhood of the detected 
point. An extensive presentation of this feature is in [11]. 

8.2. Datasets and Settings 

Midly Crowded Dataset (MCD): the dataset is a collection 
of moderately crowded videos taken from both public 
benchmarks with the addition of ad-hoc sequences. This 
dataset consists of 4 sequences: the well-known PETS09 
S2L2 and S2L3 sequences, and 2 new sequences. GVEII 
is characterized by a high number of pedestrian crossing 
the scene (up to 107 people per frame), while IshatianS, 
captured by [27], is a sequence characterized by a high 
density and clutter (up to 227 people per frame). A single 
training stage was performed by gathering the first 30% of 
each video. These frames have not been used at test time. 

MOT Challenge: the dataset consists of several public avail¬ 
able sequences in different scenarios. Detections and an¬ 
notations are provided by the MOTChallenge website. In 
our test we consider the subset of the sequences coming 
from fixed cameras since distances are not meaningful in 
the moving camera settings: TUD-Crossing, PETS09-S2L2, 
AVG-TownCentre, ADL-Rundle-3, KITTI-16 and Venice-1. 
Learning was performed on a distinct set of sequences pro¬ 
vided on the website for training. 

8.3. Comparative evaluation 

Results on MCD: Quantitative results of our proposal on 
the MCD dataset compared with the state of the art trackers 
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MOTA 

MOTP 

MT 

ML 

IDS 

FRG 

LDCT 

w.n. 

47.7 

68.8 

88 

26 

209 

103 

LDCT (all features) 

/ 

40.6 

66.3 

61 

43 

446 

193 

LDCT (only simple) 


36.4 

64.7 

58 

50 

586 

276 

Bae and Yun [2] 

/ 

39.0 

65.8 

84 

35 

637 

289 

Possegger a/. [21] 


38.7 

65.0 

79 

37 

455 

440 

Milan et al. [19] 


40.6 

66.7 

64 

42 

242 

141 


Table 2: Average results on MCD. In the appearance column, w.n. 
is when needed. More details on the light gray baselines in the text. 


1 MOTA 1 MOTP 1 MT 1 ML 1 FP | FN | IDS | FRG 

Online 

LDCT 

43.1 

74.5 

9 

10 

682 

2780 

161 

187 

RMOT 

30.4 

70.2 

2 

27 

1011 

3259 

74 

125 

TC ODAL 

24.2 

70.9 

1 

31 

1047 

3528 

75 

152 

Offline 

MotiCon 

32.0 

70.6 

2 

30 

111 

3280 

no 

105 

SegTrack 

32.3 

72.1 

3 

38 

520 

3454 

80 

76 

CEM 

28.1 

71.2 

5 

24 

1256 

3088 

87 

97 

SMOT 

23.9 

71.7 

2 

27 

706 

3627 

120 

208 

TBD 

28.0 

71.3 

3 

25 

1233 

3083 

192 

193 

DP_NMS 

22.7 

71.4 

3 

17 

1062 

3052 

529 

325 


Table 3: Averaged results of our method (LDCT) and the other 
MOT Challenge competitors on the 6 fixed camera sequences. See: 

http : //www. mot challenge . net for detailed results. 


are presented in Tab. 2, while visual results are in Fig. 5. We 
compared against two very recent online methods [21, 2] 
that focus either on target motion or appearance. Moreover, 
the offline method [19] has been considered being one of 
the most effective MTT methods up to now. In the MCD 
challenging sequences, we outperform the competitors in 
terms of MT values having also the lowest number of IDS 
and FRAG. This is basically due to the selective use of the 
proper features depending on the outcomes of the divide 
phase of our algorithm. This solution allows our tracker to 
take the best of both worlds against [21] and [2]. MOTA 
measure is higher as well testifying the overall quality of 
the proposed tracking scheme. Additionally, in Fig. 6 we re¬ 
ported the track lenght curves (TL) on the MCD dataset. TL 
curve is computed by considering the length of the correctly 
tracked GT trajectories plotted in descending order. The plot 
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Figure 6: Tracks length curves (TL) on MCD sequences. The gray shaded area indicates the performances reached by a simple global NN 
algorithm (lower bound) and the highest score obtained for each track combining all different methods results (upper bound). 


gives information on the ability of the tracker to build con¬ 
tinuous and correct tracks for all the ground truth elements 
in the scene, neglecting the amount of false tracks inserted. 
Our AUC is always greater than competitors’ thanks to the 
adoption of complex zones that effectively deals with oc- 
cluded/disapperared objects and keep the tracks longer. 

To evaluate the improvement due to the adoption of 
the divide and conquer steps, which is the foundation of 
our tracker, in Tab. 2 we also test two baselines: when 
either all features or spatial features only were used for all 
the assignments independently of the zone type. In both 
tests, the divide step, the parameter learning and occlusion 
handling remain as previously described. Improvement 
of the complete method (dark gray) over these baselines 
(light gray) suggests that complex features are indeed more 
benehcial when used selectively. 

Results on MOT Challenge: Tab 3 summarizes the accu¬ 
racy of our method compared to other state of the art algo¬ 
rithms on the MOT Challenge dataset. Similarly to the MCD 
experiment, we observe that our algorithm outperforms the 
other state of the art methods. Our method achieves best 
results in most of the metrics, keeping IDS and FRG rela¬ 
tively low as well. In turn, our method records the highest 
MOTA compared to others with a signihcant margin (+10%). 
Excellent results on this dataset highlight the generalization 
ability of our method, which was trained on sequences dif¬ 
ferent (although similar) from the ones in the test evaluation. 
Fig. 5 shows some qualitative examples of our results. 
Furthermore, our online tracker has been designed to per¬ 
form considerably fast. We report an average performances 
of 10 fps on the MOT Challenge sequences. The runtime is 
strongly influenced by the number of detections as well as 
by the number of tracks created up to a specihc frame. The 
performances are in line or faster than the majority of the 
current methods that report an average of 3-5 fps. 

The computational complexity of solving Eq. (1) using 
the Hungarian algorithm is 0{N Nq)^ with N the number 


of tracks and detections to be associated and No the number 
of occluded tracks. Since the complexity of the divide step 
is linear in the number of targets, our algorithm reduced the 
assignment complexity to NO{^) -f 0{N^ -h A^o)^- The 
hrst term applies for simple zones and is linear in N being 
dominated by a that is the average number of detections in 
every partition (a « N). The second term modulates the 
complexity of the association algorithm in complex zones 
by the P factor, i.e. is the percentage of complex zones in the 
scene. Eventually the Nq term is related to the recall of the 
chosen detector. As an example No can be realistically set 
to 0.3A^ and, if the percentage of complex zones P is 10%, 
the algorithm is 50x faster than its original counterpart. 

9. Conclusion 

In this work, we proposed an enhanced version of the 
Hungarian online association model to match recent features 
advancement and cope with different sequences peculiarities. 
The algorithm is able to learn to effectively partition the 
scene and choose the proper feature combination to solve 
simple and complex association in an online fashion. As 
observed in the experiments, the benehts of our divide and 
conquer approach are evident in terms of both computational 
complexity of the problem and tracking accuracy. 

The proposed tracking framework can be ex¬ 
tended/enriched with a different set of simple and 
complex features and it can learn to identify the relevant 
ones for the specihc scenario ^ This can open a major room 
for improvement by allowing the community to test the 
method with more complex and sophisticated features. We 
invite the reader to download the code and to test it by 
adding her favorite features. 


^Although analogy with cognitive theory holds for spatial features only. 






Appendix A: Block-Coordinate Frank-Wolfe where i indicates the row index. Analogously, the column 

optimization of Latent Structural SVM vector b^x i is built as follows. 


In a recent paper by Lacoste-Julien et al [16] the effi¬ 
cient use of Block-Coordinate Frank-Wolfe optimization 
for the training of structural SVM was demonstrated. They 
noted that by disposing of a maximization oracle, subgra¬ 
dient methods could be applied to solve the non-smooth 
unconstrained problem of Eq. (15). The notation follows the 
one used in the paper. 

imn^||w||2 +^y]Ffe(w), (15) 

^ k=l 

where Hk{^) is exactly the optimal value for the neces¬ 
sary max oracle. The Lagrange dual of the above Ff-slack 
formulation of Eq.(15) has m = \yk \ potential support 
vectors. Writing a^iy) for the dual variables associated with 
the the k-th training example and potential output y, the dual 
problem is given by 


b eNK,y€ yk,2e Z^j 

The function / is now differentiated by 

V/(a) = XA^Aa - 

= AA^w — b^, 


(19) 


( 20 ) 


where w = Aa. is the stationarity KKT condition that has 
to hold in order to make the duality strong. By substituting 
the definition of A and b for specific values of (/c, y, Z) we 
obtain 

V/(a)fe,y,2 = ^Ak{y,Z) 

= - / [Afc(y, Z) - ipkiy, z)'^w] (21) 

= -^Hk{y,Z). 


min f{a) = ^\\Aaf -hAcx 

CKG 0,1 2 

V- ( 16 ) 

s.t. ^ ak{y) = l,\/k e Nk 
yeyk 

Here matrix A and vector b are constructed by simple La- 
grangian derivation. The only two requirements that need to 
be satisfied in order to apply Frank-Wolfe algorithm on the 
problem of Eq.(16) are: 

• the domain M. of f{cx) has to be compact, and 

• the convex objective / has to be continuously differen¬ 
tiable. 


Which is the same hinge loss of Eq. (17). So once again 
the intuition that the linear subproblem that the Frank-Wolfe 
algorithm has to solve is strongly connected to the loss aug¬ 
mented decoding subproblem is true. Nevertheless, as op¬ 
posite to the non-latent case, here the hinge loss is also 
dependent on the latent variable Z which makes the problem 
non convex. Thus, before computing tho Hk{y, Z), a. latent 
completion step (line 4 of Alg. 1) is needed in order to ensure 
/ to be continuously differentiable over all the domain ex¬ 
cept for a finite number of points (sufficient condition). Once 
we attend these precautions, the latent formulation reduces 
to the standard SSVM case, and as such, all convergence 
results also apply to the latent case. 


Observe that the domain A4 of Eq. (16) is the product of 
K probability simplices, A4 = A137^1 x • • • x A|3;^|. It is 
thus compact by the geometrical definition of simplex. We 
now present matrix A and vector b for the latent formulation 
of SSVM and check that / is continuously differentiable. 
Recalling that for LLSVM the loss augmented decoding 
subproblem is expressed as 

.fffe(w) =ma.xHk{y,Z-,w) 

(17) 

= maxAfe(y,.E) - {w,ipk{y,Z)), 

and omitting the lagrangian dual derivation as it is a simple 
mathematical procedure, we obtain Adxm^o be composed 
of a set of m columns: 

A(i, ■) = I , Vfc G Nk, y e yfc, ^ e Zfcj , 

(18) 


Appendix B: Computational complexity de¬ 
tails 

In this section we provide some details on the compu¬ 
tational complexity results we presented in the paper. Re¬ 
call that if the problem of data association is tackled as a 
whole, the complexity of finding a perfect minimum match¬ 
ing amounts to 

0{{N + Nof), (22) 

according to widely used Hungarian implementations, where 
N is the number of currently active tracks and detections 
and No is the number of occluded tracks that can still be 
considered for association. 

When considering our method, we have to consider two 
distinct contributions to the complexity of the overall algo¬ 
rithm: 

• divide which is accomplished through a correlation 
clustering (CC) step on N elements. 




• conquer or the application of the Hungarian to the gen¬ 
erated sub-problems. 

The division step can be linear or even sublinear with 
respect to the number of elements to be splitted N. This can 
be obtained through the many recent approximate solutions 
of the CC (which is NP-hard optimally), e.g. [17, 8]. The 
conquer step has to be evaluated considering that complex 
and simple zones are solved differently. Simple zones re¬ 
sult in independent subproblems that can be solved directly 
through the Hungarian. Conversely, the complex zones are 
not independent and occluded targets have to be considered 
as well during the overall association subproblem. 

Let K be the number of clusters created by the CC and 
suppose a uniform partition of tracks and detections among 
these clusters. To simplify the notation, let us call the 

average number of elements (tracks and detections) inside 
a zone. If N is the overall number of active tracks and 
detections, each cluster has approximately ^ tracks and 
^ detections that need to be associated. Independently of 
the zone, these two quantities must coincide for the zone 
to be simple, so the complexity of solving a simple zone 
is = 0((|)^). Note that for large N, a will 

typically be much smaller than N. If simple clusters are a 
fraction of the overall number of clusters K, the final 
complexity of solving simple zones is which 

reduces to ^NO{{^)^) in the worst case hypotheses. Note 
that these subproblem can be solved in parallel and typically 
a « N when N is large. This is because a, the number of 
interfering tracks/detections, is limited by the non-maxima 
suppression response of a detector. 

Complex zones have to be solved altogether due to the 
No shared occluded tracks. If is the number of simple 
groups, the number of complex groups is (1 — ;5)7f or 
for notation convenience. Since the number of tracks and de¬ 
tections are not equal anymore, the number of rows/columns 
to consider for each group is ^. We thus obtain a complexity 
of 0{{PK^Y) and due to the addition of occluded targets, 
the complexity increases io ^ Nq)^) . Summing up 

we obtain that the overall complexity of the conquer step is 

{l-l3)NO{{^f) + 0{{l3N + Nof). (23) 

Note that when /3 = 0, i.e. all zones are simple, only the 
first term matters; while \f ^ = 1 than the contribution of the 
first term vanishes and the second term reduces to a stand 
hungarian over all the tracks/detections as in Eq. (22). 

Appendix C: Kalman filtering 

In order to propagate tracks position over occlusions, we 
employed a simple Kalman Filter predictor with a constant 
velocity measurement model. This basically means that 
while unobserved, tracks keep moving by assuming their 


velocity will not change over time. More formally, the stan¬ 
dard discrete Kalman Filter formulation, when no input is 
considered, is: 


x{k) = Ax{k — 1) ^ w{k — 1) 
z{k) = Hx{k)^v{k), 


(24) 


being the first equation the state equation and the second 
one the measurement equation. Here x{k) represents the 
state of a track at time k, while z{k) its measured position. 
H is the matrix which relates these two variables, namely 
relates the state and the measurement. A is called state 
space matrix and explain how the model should evolve over 
time by means of its physical intrinsic peculiarities. v(k) 
and w{k — 1) are the measurement and state noise random 
variables. During occlusions, the observation cannot be 
directly measured so we need to rely on the second relation 
z{k) = Hx{k)^v{k), and cannot correct the model. A state 
X for a track is usually represented by a four dimensional 
vector containing its position and velocity as follows: 


x{k) = [xx{k),Xy{k),Xx{k),Xy{k)]'^ (25) 


To describe a constant velocity linear model, we need to 
specify A and H as follows: 


10 10 
0 10 1 
0 0 10 
0 0 0 1 



0 0 
1 0 


0 

0 


(26) 


By substituting the equations, and ignoring the noises just 
for the sake of simplicity, we obtain the measurement vector: 

z{k) = [xx{k -l)-\-Xx{k- 1), Xy{k -l)+Xy{k- 1)]^, 

(27) 

which corresponds exactly to a constant velocity model due 
to the identity 2x2 submatrix in the lower right comer of A. 


Appendix D: Detailed experimental results 

In the paper we had to omit some detail on the experimen¬ 
tal results. Due to space limitations we presented the results 
only averaged over the whole sequence set of the Mildly 
Crowded Dataset (MCD) and the fixed camera sequences 
of the MOT Challenge (MOT) benchmark. In Tab. 4 and 
Tab. 5 we report per sequence results. In particular, for MCD 
we also report results for the considered competitors; while 
for the MOT benchmark results are reported for our method 
only and we let the reader refer to the benchmark site for 
competitors results: 

http://motchallenge.net/results_detail. 


References 

[1] G. A. Alvarez and S. L. Franconeri. How many objects can 
you track?: Evidence for a resource-limited attentive tracking 
mechanism. Journal of Vision, 7(13), Oct. 2007. 2 






Sequence 

Method 

onl. 

app. 

MOTA 

MOTP 

MT 

ML 

IDS 

ERG 

PETS09-S2-L2 

LDCT (our) 

/ 

w.n. 

47.4 

70.8 

6 

3 

297 

300 

42 pedestrian 

LDCT (all features) 

/ 

/ 

41.3 

69.7 

4 

7 

411 

252 

up to 33 for frame 

LDCT (only simple) 

/ 


35.7 

68.8 

3 

9 

497 

323 


Bae and Yun [2] 

/ 

/ 

30.2 

69.2 

1 

8 

284 

499 


Possegger a/. [21] 

/ 


40.0 

68.6 

8 

3 

211 

342 


MiXmi etal. [19] 



44.9 

70.2 

5 

6 

150 

165 

PETS09-S2-L3 

LDCT (our) 

/ 

w.n. 

35.2 

66.7 

6 

15 

120 

12 

44 pedestrian 

LDCT (all features) 

/ 

/ 

30.6 

65.1 

1 

20 

235 

45 

up to 42 for frame 

LDCT (only simple) 

/ 


26.1 

63.2 

1 

25 

316 

62 


Bae and Yun [2] 

/ 

/ 

28.8 

62.3 

8 

17 

96 

150 


Possegger (2/. [21] 

/ 


32.2 

64.1 

5 

12 

79 

111 


MiXdin etal. [19] 



31.3 

64.6 

7 

23 

71 

56 

GVEII 

LDCT (our) 

/ 

w.n. 

65.6 

73.5 

208 

63 

285 

71 

630 pedestrian 

LDCT (all features) 

/ 

/ 

55.6 

70.5 

172 

101 

548 

320 

up to 107 for frame 

LDCT (only simple) 

/ 


50.9 

67.9 

151 

113 

753 

418 


Bae and Yun [2] 

/ 

/ 

57.9 

71.1 

200 

75 

1023 

320 


Possegger a/. [21] 

/ 


51.1 

69.8 

153 

98 

844 

652 


Milan et al. [19] 



49.3 

71.2 

147 

87 

312 

244 

lshatian3 

LDCT (our) 

/ 

w.n. 

42.6 

61.0 

133 

23 

137 

32 

239 pedestrian 

LDCT (all features) 

/ 

/ 

34.7 

59.9 

68 

45 

592 

154 

up to 227 for frame 

LDCT (only simple) 

/ 


32.8 

58.9 

79 

52 

776 

301 


Bae and Yun [2] 

/ 

/ 

38.9 

60.7 

150 

40 

1146 

185 


Possegger (2/. [21] 

/ 


31.5 

57.4 

no 

35 

686 

654 


MXXdin etal. [19] 



36.8 

60.8 

98 

50 

435 

98 


Table 4: Comparison of the proposed method (dark grey) with the state of the art methods on the MCD dataset. In the appearance column, 
w.n. means when needed. For each sequence, we also run our code by always associating based on the whole feature set and simple features 
only (light grey baselines). 
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Table 5: Per sequence results of our method on the MOT Challenge fixed camera sequences. Last row contains mean values and is the 
one reported in the paper for comparison. Refer to the benchmark website (http : //motchallenge . net/results_detail) for 
competitors detailed results. 
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